The objective of this project is to find the best model in order to predict which customers have the potential of not using their credit card and leaving the services
#importing library
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(color_codes=True)
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
plot_confusion_matrix,
)
from sklearn import metrics
from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier,
BaggingClassifier,
)
from xgboost import XGBClassifier
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
data1=pd.read_csv('BankChurners.csv') #importing the data file
print(f'There are {data1.shape[0]} rows and {data1.shape[1]} columns in data file.')
There are 10127 rows and 21 columns in data file.
data=data1.copy()
data.head()
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | ... | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 768805383 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | ... | 1 | 3 | 12691.0 | 777 | 11914.0 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 818770008 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | ... | 1 | 2 | 8256.0 | 864 | 7392.0 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 713982108 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | ... | 1 | 0 | 3418.0 | 0 | 3418.0 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 769911858 | Existing Customer | 40 | F | 4 | High School | NaN | Less than $40K | Blue | 34 | ... | 4 | 1 | 3313.0 | 2517 | 796.0 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 709106358 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | ... | 1 | 0 | 4716.0 | 0 | 4716.0 | 2.175 | 816 | 28 | 2.500 | 0.000 |
5 rows × 21 columns
data.info() #info of the data
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CLIENTNUM 10127 non-null int64 1 Attrition_Flag 10127 non-null object 2 Customer_Age 10127 non-null int64 3 Gender 10127 non-null object 4 Dependent_count 10127 non-null int64 5 Education_Level 8608 non-null object 6 Marital_Status 9378 non-null object 7 Income_Category 10127 non-null object 8 Card_Category 10127 non-null object 9 Months_on_book 10127 non-null int64 10 Total_Relationship_Count 10127 non-null int64 11 Months_Inactive_12_mon 10127 non-null int64 12 Contacts_Count_12_mon 10127 non-null int64 13 Credit_Limit 10127 non-null float64 14 Total_Revolving_Bal 10127 non-null int64 15 Avg_Open_To_Buy 10127 non-null float64 16 Total_Amt_Chng_Q4_Q1 10127 non-null float64 17 Total_Trans_Amt 10127 non-null int64 18 Total_Trans_Ct 10127 non-null int64 19 Total_Ct_Chng_Q4_Q1 10127 non-null float64 20 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(10), object(6) memory usage: 1.6+ MB
for feature in data.columns: # Loop through all columns in the dataframe
if data[feature].dtype == 'object': # Only apply for columns with categorical strings
data[feature] = pd.Categorical(data[feature])# Replace objects with category
data.head()
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | ... | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 768805383 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | ... | 1 | 3 | 12691.0 | 777 | 11914.0 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 818770008 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | ... | 1 | 2 | 8256.0 | 864 | 7392.0 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 713982108 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | ... | 1 | 0 | 3418.0 | 0 | 3418.0 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 769911858 | Existing Customer | 40 | F | 4 | High School | NaN | Less than $40K | Blue | 34 | ... | 4 | 1 | 3313.0 | 2517 | 796.0 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 709106358 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | ... | 1 | 0 | 4716.0 | 0 | 4716.0 | 2.175 | 816 | 28 | 2.500 | 0.000 |
5 rows × 21 columns
data.isnull().sum().sort_values(ascending=False)#checking for null values.
Education_Level 1519 Marital_Status 749 Avg_Utilization_Ratio 0 Months_on_book 0 Attrition_Flag 0 Customer_Age 0 Gender 0 Dependent_count 0 Income_Category 0 Card_Category 0 Total_Relationship_Count 0 Total_Ct_Chng_Q4_Q1 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 CLIENTNUM 0 dtype: int64
data.duplicated().sum() #checking if data has duplicate values.
0
data=data.drop(['CLIENTNUM'],axis=1) #droping Client number column because it doesn't add anything to analysis.
data.describe().T #checking the overview of numerical data
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Customer_Age | 10127.0 | 46.325960 | 8.016814 | 26.0 | 41.000 | 46.000 | 52.000 | 73.000 |
| Dependent_count | 10127.0 | 2.346203 | 1.298908 | 0.0 | 1.000 | 2.000 | 3.000 | 5.000 |
| Months_on_book | 10127.0 | 35.928409 | 7.986416 | 13.0 | 31.000 | 36.000 | 40.000 | 56.000 |
| Total_Relationship_Count | 10127.0 | 3.812580 | 1.554408 | 1.0 | 3.000 | 4.000 | 5.000 | 6.000 |
| Months_Inactive_12_mon | 10127.0 | 2.341167 | 1.010622 | 0.0 | 2.000 | 2.000 | 3.000 | 6.000 |
| Contacts_Count_12_mon | 10127.0 | 2.455317 | 1.106225 | 0.0 | 2.000 | 2.000 | 3.000 | 6.000 |
| Credit_Limit | 10127.0 | 8631.953698 | 9088.776650 | 1438.3 | 2555.000 | 4549.000 | 11067.500 | 34516.000 |
| Total_Revolving_Bal | 10127.0 | 1162.814061 | 814.987335 | 0.0 | 359.000 | 1276.000 | 1784.000 | 2517.000 |
| Avg_Open_To_Buy | 10127.0 | 7469.139637 | 9090.685324 | 3.0 | 1324.500 | 3474.000 | 9859.000 | 34516.000 |
| Total_Amt_Chng_Q4_Q1 | 10127.0 | 0.759941 | 0.219207 | 0.0 | 0.631 | 0.736 | 0.859 | 3.397 |
| Total_Trans_Amt | 10127.0 | 4404.086304 | 3397.129254 | 510.0 | 2155.500 | 3899.000 | 4741.000 | 18484.000 |
| Total_Trans_Ct | 10127.0 | 64.858695 | 23.472570 | 10.0 | 45.000 | 67.000 | 81.000 | 139.000 |
| Total_Ct_Chng_Q4_Q1 | 10127.0 | 0.712222 | 0.238086 | 0.0 | 0.582 | 0.702 | 0.818 | 3.714 |
| Avg_Utilization_Ratio | 10127.0 | 0.274894 | 0.275691 | 0.0 | 0.023 | 0.176 | 0.503 | 0.999 |
# cdata is categorical data columns and ndata is numerical columns.
cdata_columns=['Attrition_Flag','Gender','Education_Level','Marital_Status','Income_Category','Card_Category']
ndata_columns=['Customer_Age','Dependent_count','Months_on_book','Total_Relationship_Count','Months_Inactive_12_mon','Contacts_Count_12_mon','Credit_Limit','Total_Revolving_Bal','Avg_Open_To_Buy','Total_Amt_Chng_Q4_Q1','Total_Trans_Amt','Total_Trans_Ct','Total_Ct_Chng_Q4_Q1','Avg_Utilization_Ratio']
for i in cdata_columns:
print(data[i].value_counts())
print('*'*50)
#checking the values of each columns for nun numeric columns.
Existing Customer 8500 Attrited Customer 1627 Name: Attrition_Flag, dtype: int64 ************************************************** F 5358 M 4769 Name: Gender, dtype: int64 ************************************************** Graduate 3128 High School 2013 Uneducated 1487 College 1013 Post-Graduate 516 Doctorate 451 Name: Education_Level, dtype: int64 ************************************************** Married 4687 Single 3943 Divorced 748 Name: Marital_Status, dtype: int64 ************************************************** Less than $40K 3561 $40K - $60K 1790 $80K - $120K 1535 $60K - $80K 1402 abc 1112 $120K + 727 Name: Income_Category, dtype: int64 ************************************************** Blue 9436 Silver 555 Gold 116 Platinum 20 Name: Card_Category, dtype: int64 **************************************************
The 'abc' value in Income column should be treated.
def histogram_boxplot(feature, figsize=(15, 8), bins=None): #function from mentor sessions to plot a
#histogram and box together
"""Boxplot and histogram combined
feature: 1-d feature array
figsize: size of fig (default (9,8))
bins: number of bins (default None / auto)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.distplot(
feature, kde=F, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.distplot(
feature, kde=False, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
feature.mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
feature.median(), color="black", linestyle="-"
) # Add median to the histogram
for i in ndata_columns: #plotting the data for all numerical columns
histogram_boxplot(data[i])
The above plots show that there are some outliers in the data which should be treated.
def perc_on_bar(z):#function from the course in order to plot bar charts with percentage
'''
plot
feature: categorical feature
the function won't work if a column is passed in hue parameter
'''
total = len(data[z]) # length of the column
plt.figure(figsize=(15,5))
#plt.xticks(rotation=45)
ax = sns.countplot(data[z],palette='Paired')
for p in ax.patches:
percentage = '{:.1f}%'.format(100 * p.get_height()/total) # percentage of each class of the category
x = p.get_x() + p.get_width() / 2 - 0.05 # width of the plot
y = p.get_y() + p.get_height() # hieght of the plot
ax.annotate(percentage, (x, y), size = 12) # annotate the percantage
plt.show() # show the plot
data.describe(exclude=np.number).T #check data details for categorical columns
| count | unique | top | freq | |
|---|---|---|---|---|
| Attrition_Flag | 10127 | 2 | Existing Customer | 8500 |
| Gender | 10127 | 2 | F | 5358 |
| Education_Level | 8608 | 6 | Graduate | 3128 |
| Marital_Status | 9378 | 3 | Married | 4687 |
| Income_Category | 10127 | 6 | Less than $40K | 3561 |
| Card_Category | 10127 | 4 | Blue | 9436 |
perc_on_bar("Attrition_Flag") #Attrition_Flag column vs count
Most of the clients are still clients of the bank.
perc_on_bar("Gender") #Gender vs Count
Most of the clients are Female.
perc_on_bar("Education_Level") #Educational_Level vs Count
Most of the clients have graduate degree.
perc_on_bar("Marital_Status") #Marital_Status vs Count
Most of the clients are Married.
perc_on_bar("Income_Category") #Income_Category vs Count
Most of the clients have less than $40K salary. As mentioned before 'abc' category should be processed to have clean data.
perc_on_bar("Card_Category") #Card_Category vs Count
Most of the clients have Blue cards.
plt.figure(figsize=(15, 7)) #heat map
sns.heatmap(data.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
There is strong correlation between "Months_on_book" and "Customer_Age"
There is strong correlation between "Total_Trans_Amt" and "Total_Trans_Ct"
There is a correlation between "Avg_Utilization_Ratio" and "Total_Revolving_Bal"
There is also a negative correlation between "Total_Trans_Amt" and "Total_Relationship_Count".
There is also a negative correlation between "Total_Trans_Ct" and "Total_Relationship_Count".
There is also a correlation between "Total_Revolving_Bal" and "Avg_Utilization_Ratio".
There is also a negative correlation between "Credit_Limit" and "Avg_Utilization_Ratio".
There is also a negative correlation between "Avg_Open_To_Buy" and "Avg_Utilization_Ratio".
sns.pairplot(data=data, diag_kind="kde") #pairplot for the entire data set
plt.show()
# function to plot stacked bar chart from mentored learning sessions.
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
plt.legend(
loc="lower left",
frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
stacked_barplot(data,'Gender','Attrition_Flag')
Attrition_Flag Attrited Customer Existing Customer All Gender All 1627 8500 10127 F 930 4428 5358 M 697 4072 4769 ------------------------------------------------------------------------------------------------------------------------
Female customers have more tendency to drop the credit card services
stacked_barplot(data,'Education_Level','Attrition_Flag')
Attrition_Flag Attrited Customer Existing Customer All Education_Level All 1371 7237 8608 Graduate 487 2641 3128 High School 306 1707 2013 Uneducated 237 1250 1487 College 154 859 1013 Doctorate 95 356 451 Post-Graduate 92 424 516 ------------------------------------------------------------------------------------------------------------------------
Customers with higher education have more tendency to drop the credit card services.
stacked_barplot(data,'Marital_Status','Attrition_Flag')
Attrition_Flag Attrited Customer Existing Customer All Marital_Status All 1498 7880 9378 Married 709 3978 4687 Single 668 3275 3943 Divorced 121 627 748 ------------------------------------------------------------------------------------------------------------------------
Among all customers it seems that married customers are more interested in credit card services of the company. but also 50% of those who droped the credit cards are among married clients.
stacked_barplot(data,'Income_Category','Attrition_Flag')
Attrition_Flag Attrited Customer Existing Customer All Income_Category All 1627 8500 10127 Less than $40K 612 2949 3561 $40K - $60K 271 1519 1790 $80K - $120K 242 1293 1535 $60K - $80K 189 1213 1402 abc 187 925 1112 $120K + 126 601 727 ------------------------------------------------------------------------------------------------------------------------
Clients with lowest income seems that are not interested in our services.
stacked_barplot(data,'Card_Category','Attrition_Flag')
Attrition_Flag Attrited Customer Existing Customer All Card_Category All 1627 8500 10127 Blue 1519 7917 9436 Silver 82 473 555 Gold 21 95 116 Platinum 5 15 20 ------------------------------------------------------------------------------------------------------------------------
Customers with higher level of card have more tendency to keep the credit card services.
First of all I would like to treat the 'abc' value in 'Income_Category' column. In order to do so, I will replace the 'abc' value to "Nan" and then I will treat them as Null values
data.loc[data['Income_Category'] == 'abc','Income_Category']=np.nan #replace 'abc' value with Nan
data.isnull().sum().sort_values(ascending=False)#checking for null values.
Education_Level 1519 Income_Category 1112 Marital_Status 749 Avg_Utilization_Ratio 0 Total_Ct_Chng_Q4_Q1 0 Customer_Age 0 Gender 0 Dependent_count 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Attrition_Flag 0 dtype: int64
1112 values have turned to Nan because of 'abc'
##Attrition_Flag is our target column so I will change the values to 0 and 1 for further processing.
data["Attrition_Flag"].replace('Existing Customer', 0, inplace=True)
data["Attrition_Flag"].replace('Attrited Customer', 1, inplace=True)
### Finding outliers among all numerical columns and based on the EDA we've done before
#loop for calculating IQR in order to check for outliers and then replacing them with null values.
for x in ['Customer_Age']:
q75,q25 = np.percentile(data.loc[:,x],[75,25])
intr_qr = q75-q25
max = q75+(1.5*intr_qr)
min = q25-(1.5*intr_qr)
data.loc[data[x] < min,x] = np.nan
data.loc[data[x] > max,x] = np.nan
for x in ['Months_on_book']:
q75,q25 = np.percentile(data.loc[:,x],[75,25])
intr_qr = q75-q25
max = q75+(1.5*intr_qr)
min = q25-(1.5*intr_qr)
data.loc[data[x] < min,x] = np.nan
data.loc[data[x] > max,x] = np.nan
for x in ['Months_Inactive_12_mon']:
q75,q25 = np.percentile(data.loc[:,x],[75,25])
intr_qr = q75-q25
max = q75+(1.5*intr_qr)
min = q25-(1.5*intr_qr)
data.loc[data[x] < min,x] = np.nan
data.loc[data[x] > max,x] = np.nan
for x in ['Contacts_Count_12_mon']:
q75,q25 = np.percentile(data.loc[:,x],[75,25])
intr_qr = q75-q25
max = q75+(1.5*intr_qr)
min = q25-(1.5*intr_qr)
data.loc[data[x] < min,x] = np.nan
data.loc[data[x] > max,x] = np.nan
for x in ['Credit_Limit']:
q75,q25 = np.percentile(data.loc[:,x],[75,25])
intr_qr = q75-q25
max = q75+(1.5*intr_qr)
min = q25-(1.5*intr_qr)
data.loc[data[x] < min,x] = np.nan
data.loc[data[x] > max,x] = np.nan
for x in ['Avg_Open_To_Buy']:
q75,q25 = np.percentile(data.loc[:,x],[75,25])
intr_qr = q75-q25
max = q75+(1.5*intr_qr)
min = q25-(1.5*intr_qr)
data.loc[data[x] < min,x] = np.nan
data.loc[data[x] > max,x] = np.nan
for x in ['Total_Amt_Chng_Q4_Q1']:
q75,q25 = np.percentile(data.loc[:,x],[75,25])
intr_qr = q75-q25
max = q75+(1.5*intr_qr)
min = q25-(1.5*intr_qr)
data.loc[data[x] < min,x] = np.nan
data.loc[data[x] > max,x] = np.nan
for x in ['Total_Trans_Amt']:
q75,q25 = np.percentile(data.loc[:,x],[75,25])
intr_qr = q75-q25
max = q75+(1.5*intr_qr)
min = q25-(1.5*intr_qr)
data.loc[data[x] < min,x] = np.nan
data.loc[data[x] > max,x] = np.nan
for x in ['Total_Trans_Ct']:
q75,q25 = np.percentile(data.loc[:,x],[75,25])
intr_qr = q75-q25
max = q75+(1.5*intr_qr)
min = q25-(1.5*intr_qr)
data.loc[data[x] < min,x] = np.nan
data.loc[data[x] > max,x] = np.nan
data.isnull().sum().sort_values(ascending=False)#checking for null values.
Education_Level 1519 Income_Category 1112 Credit_Limit 984 Avg_Open_To_Buy 963 Total_Trans_Amt 896 Marital_Status 749 Contacts_Count_12_mon 629 Total_Amt_Chng_Q4_Q1 395 Months_on_book 386 Months_Inactive_12_mon 331 Total_Trans_Ct 2 Customer_Age 2 Avg_Utilization_Ratio 0 Gender 0 Dependent_count 0 Total_Relationship_Count 0 Card_Category 0 Total_Ct_Chng_Q4_Q1 0 Total_Revolving_Bal 0 Attrition_Flag 0 dtype: int64
we can see that so many outliers are detected.
imputer = KNNImputer(n_neighbors=5) #I'm going to use KNN strategy to treat the outliers and null values.
# defining a list with names of columns that will be used for imputation and needs to be encoded
col_for_impute = [
"Education_Level",
"Income_Category",
"Marital_Status",
"Credit_Limit",
"Avg_Open_To_Buy",
"Total_Trans_Amt",
"Contacts_Count_12_mon",
"Total_Amt_Chng_Q4_Q1",
"Months_on_book",
"Months_Inactive_12_mon",
"Total_Trans_Ct",
"Customer_Age"
]
data[col_for_impute].head()#checking the data we created header
| Education_Level | Income_Category | Marital_Status | Credit_Limit | Avg_Open_To_Buy | Total_Trans_Amt | Contacts_Count_12_mon | Total_Amt_Chng_Q4_Q1 | Months_on_book | Months_Inactive_12_mon | Total_Trans_Ct | Customer_Age | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | High School | $60K - $80K | Married | 12691.0 | 11914.0 | 1144.0 | 3.0 | NaN | 39.0 | 1.0 | 42.0 | 45.0 |
| 1 | Graduate | Less than $40K | Single | 8256.0 | 7392.0 | 1291.0 | 2.0 | NaN | 44.0 | 1.0 | 33.0 | 49.0 |
| 2 | Graduate | $80K - $120K | Married | 3418.0 | 3418.0 | 1887.0 | NaN | NaN | 36.0 | 1.0 | 20.0 | 51.0 |
| 3 | High School | Less than $40K | NaN | 3313.0 | 796.0 | 1171.0 | 1.0 | NaN | 34.0 | 4.0 | 20.0 | 40.0 |
| 4 | Uneducated | $60K - $80K | Married | 4716.0 | 4716.0 | 816.0 | NaN | NaN | 21.0 | 1.0 | 28.0 | 40.0 |
dataimpute=data.copy() #copying data in order to start changing the data structure
# we need to pass numerical values for each categorical column for KNN imputation so we will label encode them
Education_Level = {"Uneducated ": 0, "High School": 1, "College": 2, "Graduate": 3,
"Post-Graduate":4, "Doctorate":5}
dataimpute["Education_Level"] = dataimpute["Education_Level"].map(Education_Level)
#######
Income_Category = {"Less than $40K ": 0, "$40K - $60K": 1, "$60K - $80K": 2, "$80K - $120K": 3,
"$120K +":4}
dataimpute["Income_Category"] = dataimpute["Income_Category"].map(Income_Category)
######
Marital_Status = {"Single": 0, "Married": 1, "Divorced": 2}
dataimpute["Marital_Status"] = dataimpute["Marital_Status"].map(Marital_Status)
dataimpute.head()
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 45.0 | M | 3 | 1.0 | 1 | 2.0 | Blue | 39.0 | 5 | 1.0 | 3.0 | 12691.0 | 777 | 11914.0 | NaN | 1144.0 | 42.0 | 1.625 | 0.061 |
| 1 | 0 | 49.0 | F | 5 | 3.0 | 0 | NaN | Blue | 44.0 | 6 | 1.0 | 2.0 | 8256.0 | 864 | 7392.0 | NaN | 1291.0 | 33.0 | 3.714 | 0.105 |
| 2 | 0 | 51.0 | M | 3 | 3.0 | 1 | 3.0 | Blue | 36.0 | 4 | 1.0 | NaN | 3418.0 | 0 | 3418.0 | NaN | 1887.0 | 20.0 | 2.333 | 0.000 |
| 3 | 0 | 40.0 | F | 4 | 1.0 | NaN | NaN | Blue | 34.0 | 3 | 4.0 | 1.0 | 3313.0 | 2517 | 796.0 | NaN | 1171.0 | 20.0 | 2.333 | 0.760 |
| 4 | 0 | 40.0 | M | 3 | NaN | 1 | 2.0 | Blue | 21.0 | 5 | 1.0 | NaN | 4716.0 | 0 | 4716.0 | NaN | 816.0 | 28.0 | 2.500 | 0.000 |
X = dataimpute.drop(["Attrition_Flag"], axis=1)
y = dataimpute["Attrition_Flag"]
# Splitting data into training, validation and test set
# first we split data into 2 parts,temporary and test
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=1, stratify=y
)
# then we split the temporary set into train and validation.
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)
print(X_train.shape, X_val.shape, X_test.shape)
(6075, 19) (2026, 19) (2026, 19)
# Fit and transform the train data in order to treat nan values.
X_train[col_for_impute] = imputer.fit_transform(X_train[col_for_impute])
# Transform the train data
X_val[col_for_impute] = imputer.fit_transform(X_val[col_for_impute])
# Transform the test data
X_test[col_for_impute] = imputer.transform(X_test[col_for_impute])
#get dummies for categorical variables in order to prepare the data for modeling
X_train = pd.get_dummies(X_train, drop_first=True)
X_val = pd.get_dummies(X_val, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)
print(X_train.shape, X_val.shape, X_test.shape)
(6075, 21) (2026, 21) (2026, 21)
#check if there is any Null values in the data sets after KNN
print(X_train.isna().sum())
print("-" * 30)
print(X_val.isna().sum())
print("-" * 30)
print(X_test.isna().sum())
Customer_Age 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 Gender_M 0 Card_Category_Gold 0 Card_Category_Platinum 0 Card_Category_Silver 0 dtype: int64 ------------------------------ Customer_Age 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 Gender_M 0 Card_Category_Gold 0 Card_Category_Platinum 0 Card_Category_Silver 0 dtype: int64 ------------------------------ Customer_Age 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 Gender_M 0 Card_Category_Gold 0 Card_Category_Platinum 0 Card_Category_Silver 0 dtype: int64
we would like to predict the clients that are interested in stopping the credit card utilization! So we are looking to maximize the Recall in order to minimize the false negatives! False Negative here means that the client would like to drop the credit card but it is predicted that the client would keep the credit card.
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1,
},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
models = [] # Empty list to store all the models
# Appending models into the list. 6 models are going to be fitted as requested for the assignment.
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("Xgboost", XGBClassifier(random_state=1, eval_metric="logloss")))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))
results = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Performance:" "\n")
for name, model in models:
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scoring, cv=kfold
)
results.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean() * 100))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
Cross-Validation Performance: Bagging: 73.56462585034015 Random forest: 71.72056514913659 GBM: 76.84563055991627 Adaboost: 76.43066457352171 Xgboost: 81.45107273678703 dtree: 73.15541601255887 Validation Performance: Bagging: 0.7699386503067485 Random forest: 0.7699386503067485 GBM: 0.8343558282208589 Adaboost: 0.8220858895705522 Xgboost: 0.8404907975460123 dtree: 0.7822085889570553
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
The best 3 models above are :
1- XGBoost
2- GBM
3- Adaboost
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1,
},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target): #making confusion matrix function from mentored learning sessions.
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
%%time
# defining model
model = XGBClassifier(random_state=1,eval_metric='logloss')
# Parameter to pass in RandomizedSearchCV
param_grid={'n_estimators':np.arange(50,150,50),
'scale_pos_weight':[2,5,10],
'learning_rate':[0.01,0.1,0.2,0.05],
'gamma':[0,1,3,5],
'subsample':[0.8,0.9,1],
'max_depth':np.arange(1,5,1),
'reg_lambda':[5,10]}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
xgb_obj = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1, n_jobs = -1)
#Fitting best parameters in RandomizedSearchCV
xgb_obj.fit(X_train,y_train)
xgb_tuned = xgb_obj.best_estimator_
xgb_tuned.fit(X_train, y_train)
CPU times: user 3.02 s, sys: 301 ms, total: 3.32 s Wall time: 28.9 s
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, eval_metric='logloss',
gamma=1, gpu_id=-1, importance_type='gain',
interaction_constraints='', learning_rate=0.05, max_delta_step=0,
max_depth=3, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=50, n_jobs=12,
num_parallel_tree=1, random_state=1, reg_alpha=0, reg_lambda=10,
scale_pos_weight=10, subsample=1, tree_method='exact',
validate_parameters=1, verbosity=None)
# Calculating different metrics on train set
xgboost_random_train = model_performance_classification_sklearn(
xgb_tuned, X_train, y_train
)
print("Training performance:")
xgboost_random_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.833251 | 0.959016 | 0.490309 | 0.648873 |
# Calculating different metrics on validation set
xgboost_random_val = model_performance_classification_sklearn(xgb_tuned, X_val, y_val)
print("Validation performance:")
xgboost_random_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.830701 | 0.944785 | 0.486572 | 0.642336 |
it seems that we don't have the overfitting problem because we get a good recall score for both training and validation. but precision is not in the range that we are expected to get.
confusion_matrix_sklearn(xgb_tuned, X_val, y_val)
There is not much difference in the Recall for training and validation data, but precision is not good enough
%%time
# defining model
model = GradientBoostingClassifier(random_state=1)
# Parameter to pass in RandomizedSearchCV
param_grid={"n_estimators": [100,150,200,250],
"subsample":[0.8,0.9,1],
"max_features":[0.7,0.8,0.9,1]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
gbm_obj = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1, n_jobs = -1)
#Fitting parameters in RandomizedSearchCV
gbm_obj.fit(X_train,y_train)
gbm_tuned = gbm_obj.best_estimator_
gbm_tuned.fit(X_train, y_train)
CPU times: user 5.23 s, sys: 93.4 ms, total: 5.33 s Wall time: 41.1 s
GradientBoostingClassifier(max_features=0.8, n_estimators=250, random_state=1,
subsample=0.9)
# Calculating different metrics on train set
gbm_random_train = model_performance_classification_sklearn(
gbm_tuned, X_train, y_train
)
print("Training performance:")
gbm_random_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.985021 | 0.927254 | 0.978378 | 0.95213 |
# Calculating different metrics on validation set
gbm_random_val = model_performance_classification_sklearn(gbm_tuned, X_val, y_val)
print("Validation performance:")
gbm_random_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.966436 | 0.861963 | 0.924342 | 0.892063 |
confusion_matrix_sklearn(gbm_tuned, X_val, y_val)
There is a decrease in Recall comparing to Xgboost model but the accuracy and Precision improved
%%time
# defining model
model = AdaBoostClassifier(random_state=1)
# Parameter to pass in RandomizedSearchCV
param_rand={"base_estimator":[DecisionTreeClassifier(max_depth=1),DecisionTreeClassifier(max_depth=2),DecisionTreeClassifier(max_depth=3)],
"n_estimators": np.arange(10,110,10),
"learning_rate":np.arange(0.1,2,0.1)
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
ab_obj = RandomizedSearchCV(estimator=model, param_distributions=param_rand, n_iter=50, scoring=scorer, cv=5, random_state=1, n_jobs = -1)
#Fitting parameters in RandomizedSearchCV
ab_obj.fit(X_train,y_train)
ab_tuned = ab_obj.best_estimator_
ab_tuned.fit(X_train, y_train)
CPU times: user 2.23 s, sys: 90.7 ms, total: 2.32 s Wall time: 19.3 s
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=2),
learning_rate=0.6, n_estimators=70, random_state=1)
# Calculating different metrics on train set
ab_random_train = model_performance_classification_sklearn(
ab_tuned, X_train, y_train
)
print("Training performance:")
ab_random_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.975309 | 0.891393 | 0.95186 | 0.920635 |
# Calculating different metrics on validation set
ab_random_val = model_performance_classification_sklearn(ab_tuned, X_val, y_val)
print("Validation performance:")
ab_random_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.963968 | 0.868098 | 0.904153 | 0.885759 |
confusion_matrix_sklearn(ab_tuned, X_val, y_val)
Adaboost hast the weekest performace among the 3 models but still better precision and accuracy compare to XGboost
# training performance comparison
models_train_comp_df = pd.concat(
[ab_random_train.T,gbm_random_train.T,xgboost_random_train.T],axis=1)
models_train_comp_df.columns = ["Adaboost Tuned with Random search",
"Gradient Boost Tuned with Random search",
"Xgboost Tuned with Random Search"]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Adaboost Tuned with Random search | Gradient Boost Tuned with Random search | Xgboost Tuned with Random Search | |
|---|---|---|---|
| Accuracy | 0.975309 | 0.985021 | 0.833251 |
| Recall | 0.891393 | 0.927254 | 0.959016 |
| Precision | 0.951860 | 0.978378 | 0.490309 |
| F1 | 0.920635 | 0.952130 | 0.648873 |
# Validation performance comparison
models_val_comp_df = pd.concat(
[ab_random_val.T,gbm_random_val.T,xgboost_random_val.T],axis=1)
models_val_comp_df.columns = ["Adaboost Tuned with Random search",
"Gradient Boost Tuned with Random search",
"Xgboost Tuned with Random Search"]
print("Validation performance comparison:")
models_val_comp_df
Validation performance comparison:
| Adaboost Tuned with Random search | Gradient Boost Tuned with Random search | Xgboost Tuned with Random Search | |
|---|---|---|---|
| Accuracy | 0.963968 | 0.966436 | 0.830701 |
| Recall | 0.868098 | 0.861963 | 0.944785 |
| Precision | 0.904153 | 0.924342 | 0.486572 |
| F1 | 0.885759 | 0.892063 | 0.642336 |
Best result considering the Recall score is XGboost. we need more improvement still to get better results.
print("Before UpSampling, counts of label 'Attrited Customer': {}".format(sum(y_train == 1)))
print("Before UpSampling, counts of label 'Existing Customer': {} \n".format(sum(y_train == 0)))
sm = SMOTE(
sampling_strategy=1, k_neighbors=5, random_state=1
) # Synthetic Minority Over Sampling Technique
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
print("After UpSampling, counts of label 'Attrited Customer': {}".format(sum(y_train_over == 1)))
print("After UpSampling, counts of label 'Existing Customer': {} \n".format(sum(y_train_over == 0)))
print("After UpSampling, the shape of train_X: {}".format(X_train_over.shape))
print("After UpSampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before UpSampling, counts of label 'Attrited Customer': 976 Before UpSampling, counts of label 'Existing Customer': 5099 After UpSampling, counts of label 'Attrited Customer': 5099 After UpSampling, counts of label 'Existing Customer': 5099 After UpSampling, the shape of train_X: (10198, 21) After UpSampling, the shape of train_y: (10198,)
xgb_tuned_over=xgb_tuned.fit(X_train_over, y_train_over) #fitting the previous model on new data set
# Calculating different metrics on train set
xgb_tunedover_train = model_performance_classification_sklearn(
xgb_tuned_over, X_train_over, y_train_over
)
print("Training performance:")
xgb_tunedover_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.807119 | 0.998627 | 0.722065 | 0.83812 |
xgb_tunedover_val = model_performance_classification_sklearn(
xgb_tuned_over, X_val, y_val
)
print("validation performance:")
xgb_tunedover_val
validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.673741 | 0.978528 | 0.327852 | 0.491147 |
Compare to our previous model Recall improved on both training and validation set. but precision decreased.
gbm_tuned_over=gbm_tuned.fit(X_train_over, y_train_over)#fitting the previous model on new data set
# Calculating different metrics on train set
gbm_tunedover_train = model_performance_classification_sklearn(
gbm_tuned_over, X_train_over, y_train_over
)
print("Training performance:")
gbm_tunedover_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.985389 | 0.984311 | 0.986439 | 0.985374 |
gbm_tunedover_val = model_performance_classification_sklearn(
gbm_tuned_over, X_val, y_val
)
print("validation performance:")
gbm_tunedover_val
validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.963475 | 0.889571 | 0.884146 | 0.88685 |
Compare to GBM_tuned model, this model is improved but from the point of Recall score still XGboost is better.
ab_tuned_over=ab_tuned.fit(X_train_over, y_train_over)#fitting the previous model on new data set
# Calculating different metrics on train set
ab_tunedover_train = model_performance_classification_sklearn(
ab_tuned_over, X_train_over, y_train_over
)
print("Training performance:")
ab_tunedover_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.978917 | 0.976466 | 0.981277 | 0.978866 |
ab_tunedover_val = model_performance_classification_sklearn(
ab_tuned_over, X_val, y_val
)
print("validation performance:")
ab_tunedover_val
validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.955577 | 0.874233 | 0.853293 | 0.863636 |
The result is comparable to GBM_tuned model but still not as good as XGboost.
# training performance comparison
models_over_comp_df = pd.concat(
[ab_tunedover_train.T,gbm_tunedover_train.T,xgb_tunedover_train.T],axis=1)
models_over_comp_df.columns = ["Adaboost Tuned with Random search on oversampled data",
"Gradient Boost Tuned with Random search on oversampled data",
"Xgboost Tuned with Random Search on oversampled data"]
print("Training performance comparison:")
models_over_comp_df
Training performance comparison:
| Adaboost Tuned with Random search on oversampled data | Gradient Boost Tuned with Random search on oversampled data | Xgboost Tuned with Random Search on oversampled data | |
|---|---|---|---|
| Accuracy | 0.978917 | 0.985389 | 0.807119 |
| Recall | 0.976466 | 0.984311 | 0.998627 |
| Precision | 0.981277 | 0.986439 | 0.722065 |
| F1 | 0.978866 | 0.985374 | 0.838120 |
# Validation performance comparison
models_overval_comp_df = pd.concat(
[ab_tunedover_val.T,gbm_tunedover_val.T,xgb_tunedover_val.T],axis=1)
models_overval_comp_df.columns = ["Adaboost Tuned with Random search on oversampled data",
"Gradient Boost Tuned with Random search on oversampled data",
"Xgboost Tuned with Random Search on oversampled data"]
print("Validation performance comparison:")
models_overval_comp_df
Validation performance comparison:
| Adaboost Tuned with Random search on oversampled data | Gradient Boost Tuned with Random search on oversampled data | Xgboost Tuned with Random Search on oversampled data | |
|---|---|---|---|
| Accuracy | 0.955577 | 0.963475 | 0.673741 |
| Recall | 0.874233 | 0.889571 | 0.978528 |
| Precision | 0.853293 | 0.884146 | 0.327852 |
| F1 | 0.863636 | 0.886850 | 0.491147 |
XGboost performance is the best among the all models and The recall is slightly better with oversample data.
rus = RandomUnderSampler(random_state=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
print("Before UpSampling, counts of label 'Attrited Customer': {}".format(sum(y_train == 1)))
print("Before UpSampling, counts of label 'Existing Customer': {} \n".format(sum(y_train == 0)))
print("After UpSampling, counts of label 'Attrited Customer': {}".format(sum(y_train_un == 1)))
print("After UpSampling, counts of label 'Existing Customer': {} \n".format(sum(y_train_un == 0)))
print("After UpSampling, the shape of train_X: {}".format(X_train_un.shape))
print("After UpSampling, the shape of train_y: {} \n".format(y_train_un.shape))
Before UpSampling, counts of label 'Attrited Customer': 976 Before UpSampling, counts of label 'Existing Customer': 5099 After UpSampling, counts of label 'Attrited Customer': 976 After UpSampling, counts of label 'Existing Customer': 976 After UpSampling, the shape of train_X: (1952, 21) After UpSampling, the shape of train_y: (1952,)
xgb_tuned_under=xgb_tuned.fit(X_train_un, y_train_un)#fitting the previous model on new data set
# Calculating different metrics on train set
xgb_tunedun_train = model_performance_classification_sklearn(
xgb_tuned_under, X_train_un, y_train_un
)
print("Training performance:")
xgb_tunedun_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.772541 | 0.998975 | 0.687588 | 0.814536 |
xgb_tunedun_val = model_performance_classification_sklearn(
xgb_tuned_under, X_val, y_val
)
print("validation performance:")
xgb_tunedun_val
validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.601185 | 0.981595 | 0.285205 | 0.441989 |
The precision is getting worst but we have a really good Recall score on training and validation data.
gbm_tuned_under=gbm_tuned.fit(X_train_un, y_train_un)#fitting the previous model on new data set
# Calculating different metrics on train set
gbm_tunedun_train = model_performance_classification_sklearn(
gbm_tuned_under, X_train_un, y_train_un
)
print("Training performance:")
gbm_tunedun_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.993852 | 0.996926 | 0.990835 | 0.993871 |
gbm_tunedun_val = model_performance_classification_sklearn(
gbm_tuned_under, X_val, y_val
)
print("validation performance:")
gbm_tunedun_val
validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.93386 | 0.941718 | 0.727488 | 0.820856 |
This model is looking great compare to all other models. although the Recall score is a bit smaller than XGboost but it has a very good precision and accuracy too
ab_tuned_under=ab_tuned.fit(X_train_un, y_train_un)#fitting the previous model on new data set
# Calculating different metrics on train set
ab_tunedun_train = model_performance_classification_sklearn(
ab_tuned_under, X_train_un, y_train_un
)
print("Training performance:")
ab_tunedun_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.983094 | 0.983607 | 0.9826 | 0.983103 |
ab_tunedun_val = model_performance_classification_sklearn(
ab_tuned_under, X_val, y_val
)
print("validation performance:")
ab_tunedun_val
validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.925962 | 0.93865 | 0.701835 | 0.80315 |
This model also looks fine but GBM is stronger model than this one from the scoring point of view.
# training performance comparison
models_under_comp_df = pd.concat(
[ab_tunedun_train.T,gbm_tunedun_train.T,xgb_tunedun_train.T],axis=1)
models_under_comp_df.columns = ["Adaboost Tuned with Random search on undersampled data",
"Gradient Boost Tuned with Random search on undersampled data",
"Xgboost Tuned with Random Search on undersampled data"]
print("Training performance comparison:")
models_under_comp_df
Training performance comparison:
| Adaboost Tuned with Random search on undersampled data | Gradient Boost Tuned with Random search on undersampled data | Xgboost Tuned with Random Search on undersampled data | |
|---|---|---|---|
| Accuracy | 0.983094 | 0.993852 | 0.772541 |
| Recall | 0.983607 | 0.996926 | 0.998975 |
| Precision | 0.982600 | 0.990835 | 0.687588 |
| F1 | 0.983103 | 0.993871 | 0.814536 |
# Validation performance comparison
models_underval_comp_df = pd.concat(
[ab_tunedun_val.T,gbm_tunedun_val.T,xgb_tunedun_val.T],axis=1)
models_underval_comp_df.columns = ["Adaboost Tuned with Random search on undersampled data",
"Gradient Boost Tuned with Random search on undersampled data",
"Xgboost Tuned with Random Search on undersampled data"]
print("Validation performance comparison:")
models_underval_comp_df
Validation performance comparison:
| Adaboost Tuned with Random search on undersampled data | Gradient Boost Tuned with Random search on undersampled data | Xgboost Tuned with Random Search on undersampled data | |
|---|---|---|---|
| Accuracy | 0.925962 | 0.933860 | 0.601185 |
| Recall | 0.938650 | 0.941718 | 0.981595 |
| Precision | 0.701835 | 0.727488 | 0.285205 |
| F1 | 0.803150 | 0.820856 | 0.441989 |
XGboost has the best Recall but the precision and Accuracy is not good enough as precision and accuracy above 0.7 is requested.
gbm_tunedun_test = model_performance_classification_sklearn(
gbm_tuned_under, X_test, y_test
)
print("Test Data performance:")
gbm_tunedun_test
Test Data performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.930898 | 0.950769 | 0.713626 | 0.815303 |
we get almost the same result that we got from fitting the model to validation set.
importances = gbm_tuned_under.feature_importances_
indices = np.argsort(importances)
feature_names = data1.columns
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='Orange', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
Considering the results from the EDA and Model buliding, I can suggest the following:
1- The bank should consider the clients with lower income and lower card level because it seems they don't enjoy much of what they get from the bank.
2- Most of the clients are Females and most of the clients who are not interested in keeping their credit cards are females which should be considered.
3- The credit limit of clients plays an important role and it has positive correlation in our model. The more credit limit, the less chance of leaving the credit card services from the clients.
4- The Marital Status of the clients is also have a great importance in having the chance of dropping the services. 50% of total clients who are leaving the services are married. The bank could offer and target these category more.